mort.data <- read.delim("C:/Users/rageg/Downloads/mortality3.1.txt", header = TRUE, sep = "")

Reading in the raw data.

#Data Cleaning
mort.data <- mort.data %>%
  # Removing the Year column as it contains integers and strings, while the
  # Year-code column provides the same information but with just the year (integer)
  select(-Year, -MultipleCauseofdeathCode) %>%
  # Renaming the Year.Code column as year
  rename("Year" = "YearCode") %>%
  rename("State" = "X.State.") %>%
  rename("% Of total Deaths" = "X.ofTotalDeaths") %>%
  # Removing the 2023 column as it is unreliable and is constantly being updated.
  filter(Year != 2023)

In this Analysis, i will attempt to answer the question “Is there any discernible change in the amount of deaths linked to the use of opioid drugs over the course of the opioid crisis from 1999-2020”. I am interested in answering this question as from my recent observations of my own community, I have noticed an increase of drug addicted individuals in the state of Massachusetts, specifically Boston and the Suffolk county. Given this, I hypothesize that the regions that have the highest population locations, or cities, will experience the greatest amount of change in death, given their density and how fast opioid usage can spread. This would include major populous cities like Boston, Los Angeles, San Francisco, and the likes. To answer the question “Is there any discernible change in the amount of deaths linked to the use of opioid drugs over the course of the opioid crisis from 1999-2020”, I start by cleaning the raw data that was retrieved. I filtered for redundant columns, incomplete columns, and renamed certain columns for clarity.


#Mapping each state to a region by creating a new region column
states <- unique(mort.data$State)

state_to_region <- data.frame(
  State = states,
  Region = c(
  "Southeast", "West", "Southwest", "Southeast", "West",
  "West", "Northeast", "Northeast", "Northeast", "Southeast", "Southeast",
  "West", "West", "Midwest", "Midwest", "Midwest",
  "Midwest", "Southeast", "Southeast", "Northeast", "Northeast",
  "Northeast", "Midwest", "Midwest", "Southeast", "Midwest",
  "West", "Midwest", "West", "Northeast", "Northeast",
  "Southwest", "Northeast", "Southeast", "Midwest", "Midwest",
  "Southeast", "West", "Northeast", "Northeast", "Southeast", "Midwest",
  "Southeast", "Southeast", "West", "Northeast", "Southeast",
  "West", "Southeast", "Midwest", "West"
)
)


# Merging the original data set with the mapping
mort.data <- merge(mort.data, state_to_region, by = "State", all.x = TRUE)

For the purpose of my analysis, The region that each state belonged to was necessary. Since the original raw data did not incorporate region as a column, the above chunk manually maps each state to its respective region by creating a new data frame with the states and their region and merging that data frame with our original raw data frame through the states column. The Mort.data(raw data frame) and state_to_region data frames are merged using the “merge” function where the “by =”States” ” defines the mutual column that will merge the two frames. the This output is the general Data Frame I will be using for my analysis.

#First Plot: Total Deaths vs Year for Each state on the same graph, interactive.

totalDeaths.df <- mort.data %>%
  group_by(State, Year, Region) %>%
  summarise(TotalDeaths = sum(Deaths))

p <- ggplot(totalDeaths.df, aes(x = Year, y = TotalDeaths, color = Region, group = State)) +
  geom_line() +
  geom_point() +
  labs(title = "Total Deaths Trendline by State (1999-2020)",
       x = "Year",
       y = "Total Deaths") +
  theme_minimal()


ggplotly(p)

This first above displays the trend line of Total Deaths vs the Year for each state. As can be observed, the regions are separated by the color. If you hover over each trend line, information in regards to the state, region, Total Deaths, and Year is given. From this graphic, it can be observed that from the years 1999-2010 the slope was at a slow and steady increase, but an increase nonetheless for most of the states given by the general trend of all the lines. But from 2010-2020, notice a considerable difference at the rate of opioid related deaths increased. A perfect example is Ohio in the Midwest. As you hover over Ohio and follow its trend line on the interactive plot, notice the sudden spike of its slope from 2011 to 2020. Another such state is Florida in the Southeast, where at first there was a small decrease from 2011 to 2013 almost as if in response to an arbitrary government policy, but again the same trend as Ohio emerges from 2013 to 2020. Although this cannot be said for all of the states, there are clearly discernible changes in the amount of deaths per year from the two ranges 1999-2010 and 2011 to 2020.

#STEP 5:
# Calculate slopes for each state
slopes <- totalDeaths.df %>%
  #Gets every unique state and region duo
  group_by(Region, State) %>%
  #Arrange the Year in ascending order
  arrange(Year) %>%
  #Gets the slope from the fitment of TotalDeaths vs Year
  mutate(slope = lm(TotalDeaths ~ Year)$coefficients[2])

#Identify the state with the steepest slope within each region
max_slope <- slopes %>%
  select(-Year)%>%
  group_by(Region) %>%
  slice(which.max(slope))
  

In the above chunk,I conducted a more detailed analysis of narcotic -related deaths from 1999 to 2020. I did this by first grouping the data by Region and State using the “group_by” function, and then arranged the frame in chronological order using the “arrange” function. I then used the mutate function to calculate the slope for each of the state by fitting a linear model(lm) of totalDeaths against Year and extracted the slope from the coefficients by accessing coefficients[2], which is the slope. The second part filtered out the state witch has the maximum slope for each region. The Year column was removed as it was not necessary for analysis in this step. This is to see which state impacts the data the most for their respective regions and the state with the highest rate of drug use for each region.

selectedStates <- max_slope$State

topStates <- totalDeaths.df %>%
  filter(State %in% selectedStates)


pl3 <- ggplot(topStates, aes(x = Year, y = TotalDeaths, color = State, shape = Region)) +
  geom_line() +
  geom_point() +
  labs(title = "Total Deaths Trendline for States with steepest Slope (1999-2020)",
       x = "Year",
       y = "Total Deaths",
       color = "State") +
  theme_minimal()

ggplotly(pl3)

As can be observed from the above visual, all of the top states in each region in regards to slope follow the same trend within the ranges that was explained in the analytically hypothesis above. All of the states displayed a steady but slow slope from 1999 to 2010, but experienced a huge spike from 2011 to 2020.

#I will now choose The state with the minimum slope for each region-step 8

min_slope <- slopes %>%
  select(-Year)%>%
  group_by(Region) %>%
  slice(which.min(slope))

For contrast, using the same filtering methods, I also extracted the states with the lowest rate of increase in terms of narcotic related deaths within each region. This is also to further support my hypothesis of the discernible changes within the two time ranges (1999-2010 and 2011 to 2020).

selectedStates2 <- min_slope$State

botStates <- totalDeaths.df %>%
  filter(State %in% selectedStates2)

pl4 <- ggplot(botStates, aes(x = Year, y = TotalDeaths, color = State, shape = Region)) +
  geom_line() +
  geom_point() +
  labs(title = "Total Deaths Trendline for States with Least steep Slope (1999-2020)",
       x = "Year",
       y = "Total Deaths",
       color = "State") +
  theme_minimal()

ggplotly(pl4)

Even within the contrasted graph of the states with the lowest rate of increase, the trend remained relatively true for the ranges. Of course, given that these states already have the lowest rate of increase in their respective region, there will obviously be some states that serve as outliers in which case the hypothesis would not apply. But for Vermont in the northeast and New Mexico in the southwest, the trend remaineed true, in that the states experienced a relatively huge spike in totalDeaths vs year from 2011 to 2020 when compared to 1999 to 2010. But for the other states(Hawaii, South Dakota), there were barely noticeable changes, as explained before, given that these states are of the most extreme cases in which the change in deaths vs year is the least. For Arkansas in the Southeast, the trendline remained consistent for the most part.

breaks <- c(1999, 2010, Inf)

#Create a new variable to categorize the years into two groups
topStates <- topStates %>%
  mutate(YearGroup = cut(Year, breaks = breaks, labels = c("1999-2010", "2011-2020"), include.lowest = TRUE))

#Function to calculate lm statistics
calculate_lm_stats <- function(data) {
  lm_result <- lm(TotalDeaths ~ Year, data = data)
  return(data.frame(
    slope = coef(lm_result)[2],
    intercept = coef(lm_result)[1],
    r_squared = summary(lm_result)$r.squared
  ))
}

#Calculate lm statistics for each state and year group combination
lm_stats <- topStates %>%
  group_by(State, Region, YearGroup) %>%
  nest() %>%
  mutate(lm_stats = map(data, calculate_lm_stats)) %>%
  unnest(lm_stats)
lm_stats <- lm_stats %>%
  select(-data)
breaks <- c(1999, 2010, Inf)

# Create a new variable to categorize the years into two groups
botStates <- botStates %>%
  mutate(YearGroup = cut(Year, breaks = breaks, labels = c("1999-2010", "2011-2020"), include.lowest = TRUE))

# Function to calculate lm statistics
calculate_lm_stats <- function(data) {
  lm_result <- lm(TotalDeaths ~ Year, data = data)
  return(data.frame(
    slope = coef(lm_result)[2],
    intercept = coef(lm_result)[1],
    r_squared = summary(lm_result)$r.squared
  ))
}

# Calculate lm statistics for each state and year group combination in botStates
lm_stats_botStates <- botStates %>%
  group_by(State, Region, YearGroup) %>%
  nest() %>%
  mutate(lm_stats = map(data, calculate_lm_stats)) %>%
  unnest(lm_stats)
lm_stats_botStates <- lm_stats_botStates %>%
  select(-data)

# Display the table
#print(lm_stats_botStates)

merged_lm_stats <- dplyr::bind_rows(
  mutate(lm_stats, Type = "Top States"),
  mutate(lm_stats_botStates, Type = "Bottom States")
)


# Display the modified data frame
print(merged_lm_stats)
#> # A tibble: 20 × 7
#> # Groups:   State, Region, YearGroup [20]
#>    State        Region    YearGroup  slope intercept r_squared Type         
#>    <chr>        <chr>     <fct>      <dbl>     <dbl>     <dbl> <chr>        
#>  1 Arizona      Southwest 1999-2010  42.7    -85311.    0.964  Top States   
#>  2 Arizona      Southwest 2011-2020 160.    -320991.    0.827  Top States   
#>  3 California   West      1999-2010  90.6   -180005.    0.567  Top States   
#>  4 California   West      2011-2020 343.    -687893.    0.583  Top States   
#>  5 Florida      Southeast 1999-2010 126.    -251959.    0.954  Top States   
#>  6 Florida      Southeast 2011-2020 550.   -1104576.    0.869  Top States   
#>  7 New York     Northeast 1999-2010 107.    -214690.    0.836  Top States   
#>  8 New York     Northeast 2011-2020 538.   -1081199.    0.910  Top States   
#>  9 Ohio         Midwest   1999-2010  82.6   -164989.    0.899  Top States   
#> 10 Ohio         Midwest   2011-2020 447.    -897688.    0.778  Top States   
#> 11 Arkansas     Southeast 1999-2010  24.5    -49012.    0.919  Bottom States
#> 12 Arkansas     Southeast 2011-2020  11.8    -23615.    0.661  Bottom States
#> 13 Hawaii       West      1999-2010   3.61    -7187.    0.644  Bottom States
#> 14 Hawaii       West      2011-2020   1.08    -2120.    0.0481 Bottom States
#> 15 New Mexico   Southwest 1999-2010   6.99   -13776.    0.236  Bottom States
#> 16 New Mexico   Southwest 2011-2020  35.4    -70834.    0.667  Bottom States
#> 17 South Dakota Midwest   1999-2010   2.53    -5064.    0.791  Bottom States
#> 18 South Dakota Midwest   2011-2020   2.39    -4798.    0.330  Bottom States
#> 19 Vermont      Northeast 1999-2010   2.78    -5535.    0.256  Bottom States
#> 20 Vermont      Northeast 2011-2020  18.7    -37507.    0.907  Bottom States

In chunk above, The “mutate” function is used to create a new variable, YearGroup, based on the specified ranges. A function named “calculate_lm_stats” is defined to conduct linear regression analysis on TotalDeaths against Year within each group. This function returns a data frame containing slope, intercept, and R-squared values. The “group_by,” “nest,” and “map” functions are then utilized to group the data by State, Region, and YearGroup, nest the data for each group, and apply the “calculate_lm_stats” function, respectively. Finally, the “unnest” function is used to expand the nested lm_stats column, and unnecessary columns are removed using “select.” The resulting lm_stats data frame provides valuable insights into the linear regression statistics for each state within the specified year groups and regions. The visualization of the models was not enough to support the Hypotheses, And so I extracted the summary of the lm model for the different time range and state combination for both the states with the steepest slope and states with the least steep slopes.

As observed, for most of the states within this group, the slope more than tippled from 1999-2010 to 2011-2020, further supporting the hypothesis of discernible changes during the opioid crisis from 1999 to 2020. Furthermore, the hypothesis that the discernible change will be more noticeable for states with more populous locations is further proven. The states that were chosen for the steepest slopes within each region is also the states with either the 1st or the 2nd most population within that region,ex: New York in the Northeast, and the slope more than tippled for those states. I also extracted the states with the least steep slope from each region and as previously explained, most the states in this Data frame are not noticing any strong change in opioid usage from the two different ranges besides Arkansas, New Mexico, and Vermont, whose slop increase almost 3 times for Arkansas, almost 6 ties for New Mexico, and approximately 9 times for Vermont. Despite these states being in the extreme case of the least overall rate of change, we can still notice a huge change of rate between the two time ranges for the ones previously mentioned.